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Abstract 

Standard maximum margin structured prediction methods lack a straightforward prob- 
abilistic interpretation of the learning scheme and the prediction rule. Therefore its unique 
advantages such as dual sparseness and kernel tricks cannot be easily conjoined with the 
merits of a probabilistic model such as Bayesian regularization, model averaging, and al- 
lowing hidden variables. In this paper, we present a novel and general framework called 
Maximum Entropy Discrimination Markov Networks (MaxEnDNet), which integrates these 
two approaches and combines and extends their merits. Major innovations of this model 
include: 1) It generalizes the extant Markov network prediction rule based on a point esti- 
mator of weights to a Bayesian-style estimator that integrates over a learned distribution 
of the weights. 2) It extends the conventional max-entropy discrimination learning of clas- 
sification rule to a new structural max-entropy discrimination paradigm of learning the 
distribution of Markov networks. 3) It subsumes the well-known and powerful Maximum 
Margin Markov network (M 3 N) as a special case, and leads to a model similar to an L\- 
regularized M 3 N that is simultaneously primal and dual sparse, or other types of Markov 
network by plugging in different prior distributions of the weights. 4) It offers a simple infer- 
ence algorithm that combines existing variational inference and convex-optimization based 
M 3 N solvers as subroutines. 5) It offers a PAC-Bayesian style generalization bound. This 
work represents the first successful attempt to combine Bayesian-style learning (based on 
generative models) with structured maximum margin learning (based on a discriminative 
model), and outperforms a wide array of competing methods for structured input/output 
learning on both synthetic and real OCR and web data extraction data sets. 

Keywords: Maximum entropy discrimination Markov networks, Bayesian max-margin 
Markov networks, Laplace max-margin Markov networks, Structured prediction. 



1. Introduction 

Inferring structured predictions based on high-dimensional, often multi-modal and hybrid 
covariates remains a central problem in data mining (e.g., web-info extraction), machine 
intelligence (e.g., machine translation), and scientific discovery (e.g., genome annotation). 
Several recent approaches to this problem are based on learning discriminative graphical 
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models defined on composite features that explicitly exploit the structured dependencies 
among input elements and structured interpretational outputs. Major ins tances of such 
models include the conditional ra ndom fields (CRFs) ( Laffertv et all l200ll ). Markov net 



work s (MNs) (|Taskar et all . l2003h . and other specialized graphical models (jAltun et al.l 



20031 ) . Various paradigms for training such models based on differen t loss functions have 



been explored, including the m aximum conditional likelihood learning (Laffert v et al.1.12 001 - ) 



(Altun et al.. 


2003; 


Taskar et al.. 


2003; 



Tsochantaridis et al 



2004 ) . with remarkable success. 



The likelihood-based models for structure d predictions ar e usually based on a joint dis- 
tribution of both input and o utput variables ( Rabinei , 19891 ) or a conditional distribution 
of the output given the input laffertv et all ES^Therefore this paradigm offers a flex- 
ible probabilistic framework that can naturally facilitate: hidden var i ables that captur e 



latent semantics such as a generative hierarchy (jQuattoni et al.l . 12004 ; IZhu et al.l. 120 083); 



Bayesian regularization that imposes desirable b iases such as sparseness (|Lee et all l200fil : 



Wainwright et al. . 20061 ; Andrew and Gao . 2007 ); and Bayesian prediction based on com- 
bining predictions across all values of model parameters (i.e., model averaging), which can 
reduce the risk of overfitting. On the other hand, the margin-based structured prediction 
models leverage the maximum margin principle and convex optimization formulation under- 
l ying the support vector machines, and concentrate directly on the input-output mapping 
dTaskar et all 120031 : lAltun et all 120031 : iTsochantaridis et all 120041 ) . In principle, this ap- 
proach can lead to a robust decision boundary due to the dual sparseness (i.e., depending on 
only a few support vectors) and global optimality of the learned model. However, although 
arguably a more desirable paradigm for training highly discriminative structured prediction 
models in a number of application contexts, the lack of a straightforward probabilistic inter- 
pretation of the maximum-margin models makes them unable to offer the same flexibilities 
of likelihood-based models discussed above. 

For example, for domains with complex feature space, it is often desirable to pursue 
a "sparse" representation of the model that leaves out irrelevant features. In likelihood- 
based estimation, sparse model fitting has been extensively studied. A commonly used 
strategy is to add an Li-penalty to the likelihood function, which can also be viewed as a 
MAP estimation under a Laplace prior. However, little progress has been made so far on 
learning sparse MNs or log-linear models in general based on the maximum margin principle. 
While sparsity has been pursued in maximum margin learning of certain discriminative 
models such as S VM that are "unstructured" (i.e ., with a univariate output), by using 
Li-regularization ([Bennett and Mangasarianl . Il992j ) or by adding a cardinality constraint 



(jChan et all 120071 ) , generalization of these techniques to structured output space turns out 
to be extremely non-trivial, as we discuss later in this paper. There is also very little 
theoretical analysis on the performance guarantee of margin-based models under direct Li- 
regularization. Our empirical results as shown in this paper suggest that an L\ -regularized 
estimation, especially the likelihood based estimation, can be unrobust. Discarding the 
features that are not completely irrelevant can potentially hurt generalization ability. 

In this paper, we propose a general theory of maximum entropy discrimination Markov 
networks (MaxEnDNet, or simply MEDN) for structured input /output learning and pre- 
diction. This formalism offers a formal paradigm for integrating both generative and dis- 
criminative principles and the Bayesian regularization techniques for learning structured 
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prediction models. It integrates the spirit of maximum margin learning from SVM, the 
design of discriminative structured prediction model in maximum margin Markov networks 
(M 3 N), and the ideas of entropy regularization and model averaging in maximum entropy 
discrimination methods ( Jaakkola et al. . 19991 ). It allows one to learn a distribution of 



maximum margin structured prediction models that offers a wide range of important ad- 
vantages over conventional models such as M 3 N, including more robust prediction due to an 
averaging prediction- function based on the learned distribution of models, Bayesian-style 
regularization that can lead to a model that is simultaneous primal and dual sparse, and 
allowance of hidden variables and semi-supervised learning based on partially labeled data. 

While the formalism of MaxEnDNet is extremely general, our main focus and contribu- 
tions of this paper will be concentrated on the following results. We will formally define the 
MaxEnDNet as solving a generalized entropy optimization problem subject to expected mar- 
gin constraints due to the training data, and under an arbitrary prior of feature coefficients; 
and we offer a general close-form solution to this problem. An interesting insight imme- 
diately follows this general solution is that, a trivial assumption on the prior distribution 
of the coefficients, i.e., a standard normal, reduces the linear MaxEnDNet to the standard 
M 3 N, as shown in Theorem [3j This understanding opens the way to use different priors for 
MaxEnDNet to achieve more interesting regularization effects. We show that, by using a 
Laplace prior for the feature coefficients, the resulting LapMEDN is effectively an M 3 N that 
is not only dual sparse (i.e., defined by a few support vectors), but also primal sparse (i.e., 
shrinkage on coefficients corresponding to irrelevant features). We develop a novel varia- 
tional approximate learning method for the LapMEDN , which leverages on the hierarchical 
representation of the Laplace prior (IFigueire del 120031 ) and the reducibility of MaxEnDNet 



to M 3 N, and combines the var iation Bayesian technique with ex i sting convex optimizatio n 



algorithms developed for M 3 N (ITaskar et all 12003 : iBartlett et all 12004 ; iRatliff et all 120071 ) . 



We also provide a formal analysis of the generalization error of the MaxEnDNet, and prove a 
novel PAC-Bayes bound on the structured prediction error by MaxEnDNet. We performed 
a thorough comparison of the Laplace MaxEnDNet with a competing methods, including 
M 3 N (i.e., the Gaussian MaxEnDNet), L x -regularized M 3 N0, CRFs, L\ -regularized CRFs, 
and /^-regularized CRFs, on both synthetic and real structured input /output data. The 
Laplace MaxEnDNet exhibits mostly superior, and sometimes comparable performance in 
all scenarios been tested. 

The rest of the paper is structured as follows. In the next section, we review the 
basic structured prediction formalism and set the stage for our model. Section 3 presents 
the general theory of maximum entropy discrimination Markov networks and some basic 
theoretical results, followed by two instantiations of the general MaxEnDNet, the Gaussian 
MaxEnDNet and the Laplace MaxEnDNet. Section 4 offers a detailed discussion of the 
primal and dual sparsity property of Laplace MaxEnDNet. Section 5 presents a novel 
iterative learning algorithm based on variational approximation and convex optimization. 
In Section 6, we briefly discuss the generalization bound of MaxEnDNet. Then, we show 
empirical results on both synthetic and real OCR and web data extraction data sets in 
Section 7. Section 8 discusses some related work and Section 9 concludes this paper. 



1. This model has not yet been reported in the literature, and represents another new extension of the 
M 3 N, which we will present in a separate paper in detail. 
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2. Preliminaries 



In structured prediction problems such as natural language parsing, image annotation, or 
DNA decoding, one aims to learn a function h : X — > y that maps a structured input 
xG^f, e.g., a sentence or an image, to a structured output y £ 3^, e.g., a sentence parsing 
or a scene annotation, where, unlike a standard classification problem, y is a multivariate 
prediction consisting of multiple labeling elements. Let L denote the cardinality of the 
output, and mi where I = 1, . . . , L denote the arity of each element, then y = x • • • X 
with yi = {a%,. . . ,a mi } represents a combinatorial space of structured interpretations of 
the multi-facet objects in the inputs. For example, y could correspond to the space of all 
possible instantiations of the parse trees of a sentence, or the space of all possible ways of 
labeling entities over some segmentation of an image. The prediction y = (yi, . . . , y£) is 
structured because each individual label y\ 6 y% within y must be determined in the context 
of other labels yi'^i, rather than independently as in classification, in order to arrive at a 
globally satisfactory and consistent prediction. 

Let F : X x y — > R represent a discriminant function over the input-output pairs from 
which one can define the predictive function, and let Ti denote the space of all possible F. 
A common choice of F is a linear model, F(x, y; w) = y(w T f (x, y)), where f = [/i . . . /^-] T 
is a iT-dimensional column vector of the feature functions ft '■ X x y — > R, and w = 
[w% . . . wk] t is the corresponding vector of the weights of the feature functions. Typically, 
a structured prediction model chooses an optimal estimate w* by minimizing some loss 
function J(w), and defines a predictive function in terms of an optimization problem that 
maximizes F( ■ ;w*) over the response variable y given an input x: 

h (x; w*) = arg max F(x,y; w*), (1) 
ye^(x) 

where 3^( x ) Q y is the feasible subset of structured labels for the input x. Here, we assume 
that 3^(x) is finite for any x. 

Depending on the specific choice of F( ■ ; w) (e.g., linear, or log linear), and of the loss 
function J(w) for estimating the parameter w* (e.g., likelihood, or margin), incarnations 
of the general structured prediction f ormalism descri bed above can be seen in classical 
generative models such as the HMM ( Rabiner . 19891 ) where g( ) can be an exponential 



family distribution function and J(w) is the joint like lihood of the input and its labeling; 



and in recent discriminative models such as the CRFs (ILafferty et al.l . l200ll ). where <?(•) is a 
Boltzmann machine and J(w) is the con ditional likelihood of the structured labeling given 
input; and the M 3 N (jTaskar et all 120031 ) . where g(-) is an identity function and J(w) is the 



margin between the true labeling and any other feasible labeling in y(x). Our approach 
toward a more general discriminative training is based on a maximum entropy principle 
that allows an elegant combination of the discriminative maximum margin learning with 
the generative Bayesian regularization and hierarchical modeling, and we consider the more 
general problem of finding a distribution over TC that enables a convex combination of 
discriminant functions for robust structured prediction. 

Before delving into the exposition of the proposed approach, we end this section with a 
brief recapitulation of the basic M 3 N, upon which the proposed approach is built. Under 
a max-margin framework, given a set of fully observed training data T> = {(x*, y 2 )}^, we 
obtain a point estimate of the weight vector w by solving the following max-margin problem 
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PO (jTaskar et all 120031 ): 

PO (M 3 N) : 
s.t. V*,Vy^y l : 



N 



mm — w 

w,« 2" 1 



i=l 



w T A^(y) > A^(y) &>0 



where Afj(y) = f(x*,y J ) — f(x l ,y) and AFj(y;w) = w T Afj(y) is the "margin" between 
the true label y l and a prediction y, A£j(y) is a loss function with respect to y*, and £j 
represents a slack variable that absor bs errors in the train i ng da ta. Various loss functions 



have been proposed in the l iterature ( Tsochantaridis et al. . 20041 ) . In this paper, we adopt 



the hamming loss used in (jTaskar et all |2003| ): A£j(y) 



Zf=iKVj ¥= V}), where I(-) is 
an indicator function that equals to one if the argument is true and zero otherwise. The 
optimization problem PO is intractable because the feasible space for w, 

T = {w : w T Afi(y) > A£i(y) - fc; Vi, Vy ± y 1 }, 

is defined by 0(-/V|,y| ) number of constraints, and y itself is exponential to the size of the in- 
put x. Exploring sparse dependencies among individual labels yi in y, as reflected in the spe- 
cific design of the feature functions (e.g., based on pair- wise labeling potentials in a pair- wise 
Markov network), and the convex duality of the ob jective, efficient opti mization algorithms 
based on cutting-plane (jTsochantaridis et all 12004 ) or message-passing (jTaskar et al.l . l2003h 
have been proposed to obtain an approximate optimum solution to PO. As described shortly, 
these algorithms can be directly employed as subroutines in solving our proposed model. 



3. Maximum Entropy Discrimination Markov Networks 

Instead of learning a point estimator of w as in M 3 N, in this paper, we take a Bayesian- 
style approach and learn a distribution p(w), in a max-margin manner. For prediction, we 
employ a convex combination of all possible models F( ■ ; w) S Ti based on p(w), that is: 



/ii(x) = arg max / p(w)F(x,y;w) dw 
yey(x) J 



(2) 



Now, the open question underlying this averaging prediction rule is how we can devise 
an appropriate loss function and constraints over p(w), in a similar spirit as the margin- 
based scheme over w in PO, that lead to an optimum estimate of p(w). In the sequel, we 
present Maximum Entropy Discrimination Markov Networks (MaxEnDNet, or MEDN), a 
novel framework that facilitates the estimation of a Bayesian-style regularized distribution 
of M 3 Ns defined by p(w). As we show below, this new Bayesian-style max-margin learning 
formalism offers several advantages such as simultaneous primal and dual sparsity, PAC- 
Bayesian generalization guarantee, and estimation robustness. Note that the MaxEnDNet is 
different from the tradi tional Bayesian methods for discriminative structured prediction such 
as the Bayesian CRFs ( Qi et al. . 20051 ). where the likelihood function is well defined. Here, 



our approach is of a "Bayesian-style" because it learns and uses a "posterior" distribution 
of all predictive models instead of choosing one model according to some criterion, but the 
learning algorithm is not based on the Bayes theorem, but a maximum entropy principle 
that biases towards a posterior that makes less additional assumptions over a given prior 
over the predictive models. 
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3.1 Structured Maximum Entropy Discrimination 

Given a training set V of structured input-output pairs, analogous to the feasible space 
J-q for the weight vector w in a standard M 3 N (c.f., problem PO), we define the feasible 
subspace T\ for the weight distribution p(w) by a set of expected margin constraints: 

T x = {p(w) : j p(w)[AF 1 (y;w) - A^(y)]dw > V*,Vy £ y*}. 

We learn the optimum p(w) f rom F-\ based on a stru ctured maximum entropy discrimina- 
tion principle generalized from (jjaakkola et all ll99Sh . Under this principle, the optimum 



p(w) corresponds to the distribution that minimizes its relative entropy with respect to 
some chosen prior pq, as measured by the Kullback-Leibler divergence between p and po : 
KL(p\\po) = (log(p/po))p> where (-) p denotes the expectations with respect to p. If po 
is uniform, then minimizing this KL-divergence is equivalent to maximizing the entropy 
H(p) = — (logp)p. A natural information theoretic interpretation of this formulation is that 
we favor a distribution over the hypothesis class 7i that bears minimum assumptions among 
all feasible distributions in J-\. The po is a regularizer that introduces an appropriate bias, 
if necessary. 

To accommodate non-separable cases in the discriminative prediction problem, instead of 



minim izing the usual KL, we optimize the generalized entropy (jPudik et all 120071 ; iLebanon and Laffertv 



20011 ). or a regularized KL-divergence, KL (p(w)||j>o(w)) + U(£), where £/(£) is a closed 
proper convex function over the slack variables. This term can be understood as an addi- 
tional "potential" in the maximum entropy principle. Putting everything together, we can 
now state a general formalism based on the following Maximum Entropy Discrimination 
Markov Network framework: 

Definition 1 (Maximum Entropy Discrimination Markov Networks) Given train- 
ing data T> = {(x l , y*)}^L 1; a chosen form of discriminant function F(x, y; w), a loss func- 
tion Ai(y), and an ensuing feasible subspace T\ (defined above) for parameter distribution 
p(w), the MaxEnDNet model that leads to a prediction function of the form of Eq. |Q|] is 
defined by the following generalized relative entropy minimization with respect to a param- 
eter prior po(w). - 

PI (MaxEnDNet) : min KL(p(w)\\p (w)) + U(£) 

s.t. p(w) 6fi, £j > 0,Vi. 

The PI defined above is a variational optimization problem over p(w) in a subspace of 
valid parameter distributions. Since both the KL and the function U in PI are convex, 
and the constraints in T\ are linear, PI is a convex program. In addition, the expectations 
(F(x, y; w)) p ( w ) are required to be bounded in or der for F to be a meaningful mo del. Thus, 
the problem PI satisfies the Slater's conditio^ feovd and Vandenberghe . 2004 . chap. 5), 



which together with the convexity make PI enjoy nice properties, such as strong duality 
and the existence of solutions. The problem PI can be solved via applying the calculus of 



2. Since {F(x, y; w)) p ( w ) are bounded and £j > 0, there always exists a £, which is large enough to make 
the pair (p(w),£) satisfy the Slater's condition. 
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variations to the Lagrangian to obtain a variational extremum, followed by a dual transfor- 
mation of PI. We state the main results below as a theorem, followed by a brief proof that 
lends many insights into the solution to PI which we will explore in subsequent analysis. 

Theorem 2 (Solution to MaxEnDNet) The variational optimization problem PI un- 
derlying the MaxEnDNet gives rise to the following optimum distribution of Markov network 
parameters w: 

P{W) = Z^) Po(w)eXp { £ «i(y)[AF i (y;w)-A^(y)]}, (3) 

where Z(a) is a normalization factor and the Lagrangian multipliers cti(y) (corresponding 
to the constraints in T\) can be obtained by solving the dual problem of PI: 

Dl : max - log Z(a) - U*(a) 

a 

s.t. a;(y) > 0, Vi, Vy ± y< 
where U*(-) is the conjugate of the slack function U{-), i.e., U*(a) = sup^ ( Y2i y ^ y » a i(y)d~ 

Proof (sketch) Since the problem PI is a convex program and satisfies the Slater's con- 
dition, we can form a Lagrange function, whose saddle point gives the optimal solution of 
PI and Dl, by introducing a non-negative dual variable Oj(y) for each constraint in T\ 
and another non-negative dual variable c for the normalization constraint J p(w) dw = 1. 
Details are deferred to Appendix B.l. ■ 



Since the problem PI is a convex program and satisfies the Slater's condition, the 
saddle point of the Lagrange functio n is the KKT point of PI. From the KKT condi- 
tions (jBoyd and Vandenberghd . 12004 chap. 5), it can be shown that the above solution 
enjoys dual sparsity, that is, only a few Lagrangian multipliers will be non-zero, which cor- 
respond to the active constraints whose equality holds, analogous to the support vectors in 
SVM. Thus MaxEnDNet enjoys a similar generalization property as the M 3 N and SVM due 
to the the small "effective size" of the margin constraints. But it is important to realize 
that this does not mean that the learned model is "primal-sparse", i.e., only a few elements 
in the weight vector w are non-zero. We will return to this point in Section SJ 

For a closed proper convex function </>(//), its c onjugate is defined as 4>*(y) = su p / ,[^ T M — 
4>(n)]. In the problem Dl, by convex duality (jBoyd and Vandenberg 3, l2004h . the log 
normalizer log Z(a) can be shown to be the conjugate of the KL-divergence. If the slack 
function is U(£) = C||£|| = C^j£i, it is easy to show that U*(a) = I<x>GZy a i(y) < C> Vi), 
where loo(-) is a function that equals to zero when its argument holds true and infinity 
otherwise. Here, the inequality corresponds to the trivial solution £ = 0, that is, the training 
data are perfectly separative. Ignoring this inequality does not affect the solution since the 
special case £ = is still included. Thus, the Lagrangian multipliers «i(y) in the dual 
problem Dl comply with the set of constraints that ^ y CKj(y) = C, Vi. An other example is 
U(£) = KL(p(£)\ \po(£)) by introducing uncertainty on the slack variables (jjaakkola et al 



7 



1999). In this case, expectations with respect to are taken on both sides of all the 
constraints in T\. Take the duality, and the dual function of U is another log normalizer. 
More details can be foun d in ( Jaakkola et all 19991). Some other U functi ons and their dual 



functions are studied in (jLebanon and Laffertvl . l200ll : budfk et all \200j ) . 



Unlike most extant structured discriminative models including the highly successful 
M 3 N, which rely on a point estimator of the parameters, the MaxEnDNet model derived 
above gives an optimum parameter distribution, which is used to make prediction via the 
rule ([2]). Indeed, as we will show shortly, the MaxEnDNet is strictly more general than the 
M 3 N and subsumes the later as a special case. But more importantly, the MaxEnDNet in 
its full generality offers a number of important advantages while retaining all the merits 
of the M 3 N. First, MaxEnDNet admits a prior that can be designed to introduce useful 
regularization effects, such as a primal sparsity bias. Second, the MaxEnDNet prediction is 
based on model averaging and therefore enjoys a desirable smoothing effect, with a uniform 
convergence bound on generalization error. Third, MaxEnDNet offers a principled way 
to incorporate hidden generative models underlying the structured predictions, but allows 
the predictive model to be discriminatively trained based on partially labeled data. In the 
sequel, we analyze the first two points in de tail; exploration o f the third point is beyond 
the scope of this paper, and can be found in (jZhu et all boosd ). where a partially observed 
MaxEnDNet (PoMEN) is developed, which combines (possibly latent) generative model 
and discriminative training for structured prediction. 



3.2 Gaussian MaxEnDNet 

As Eq. ([3]) suggests, different choices of the parameter prior can lead to different MaxEnD- 
Net models for predictive parameter distribution. In this subsection and the following one, 
we explore a few common choices, e.g., Gaussian and Laplace priors. 

We first show that, when the parameter prior is set to be a standard normal, MaxEnDNet 
leads to a predictor that is identical to that of the M 3 N. This somewhat surprising reduction 
offers an important insight for understanding the property of MaxEnDNet. Indeed this 
result should not be totally unexpected given the striking isomorphisms of the opt-problem 
PI, the feasible space J-\, and the predictive function h\ underlying a MaxEnDNet, to their 
counterparts PO, J-q, and ho, respectively, underlying an M 3 N. The following theorem makes 
our claim explicit. 

Theorem 3 (Gaussian MaxEnDNet: Reduction of MEDN to M 3 N) Assuming 
F(x, y; w) = w T f (x, y), U(£) = C ^ £j, and po( w ) = JV(w|0, I), where I denotes an iden- 
tity matrix, then the posterior distribution isp(w) = jV(w|/x, I), where \x = Yli y ^ y * a «(y)Afi(y), 
and the Lagrangian multipliers ati(y) inp(w) are obtained by solving the following dual prob- 
lem, which is isomorphic to the dual form of the M S N: 

max ai ( y )A^( y )_i|| £ « i (y)Af i (y)|| 2 

s.t. a,(y) = C; a 4 (y) >0, Vi, Vy/y l , 

y¥=y i 

where Afi(y) = f(x\y l ) — f(x\y) as in PO. When applied to h\, p(w) leads to a predictive 
function that is identical to /iq(x;w) given by Eq. {7J). 
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Proof See Appendix B.2 for details. 



The above theorem is stated in the duality form. We can also show the following 
equivalence in the primal form. 

Corollary 4 Under the same assumptions as in Theorem the mean \x of the posterior 
distribution p(w) under a Gaussian MaxEnDNet is obtained by solving the following primal 
problem: 

1 N 

s.t. p T Afc(y) > A£ 4 (y) - & > 0, Vi, Vy ^ y\ 
Proof See Appendix B.3 for details. ■ 

Theorem [3] and Corollary U] both show that in the supervised learning setting, the M 3 N 
is a special case of MaxEnDNet when the slack function is linear and the parameter prior is 
a standard normal. As we shall see later, this connection renders many existing techniques 
for solving the M 3 N directly applicable for solving the MaxEnDNet. 



3.3 Laplace MaxEnDNet 

Recent trends in pursui ng "sparse" graph i cal m odels has led to the em ergence of regu- 
larized version of CRFs ( Andrew and Gao . 20071 ) and Markov networks ( Lee et al. . 20061 ; 



Wainwright et all l200fil ). Interestingly, while such extensions have been successfully im 



plemented by several authors in maximum likelihood learning of various sparse graphical 
models, they have not yet been explored in the context of maximum margin learning. Such 
a gap is not merely due to a negligence. Indeed, learning a sparse M 3 N can be significantly 
harder as we discuss below. 



O ne possible way t o learn a sparse M 3 N is to adopt the strategy of Li-SVM (jBennett and Mangasarian 
1992 : Zhu et al. . 20041 ) and directly use an L\ instead of the L2-norm of w in the loss func- 
tion (see appendix A for a detailed description of this formulation and the duality deriva- 
tion). However, the primal problem of an L\ -regularized M 3 N is not directly solvable by 
re-formulating it as an LP problem due to the exponential number of constraints; solv- 
ing the dual problem, which now has only a polynomial number of constraints as in the 
dual of M 3 N, is still non-trivial due to the complicated form of the constraints as shown 
in appen dix A. The constraint gener ation methods are possible. However, although such 
methods ( Tsochantaridis et al. . 20041 ) have been shown to be efficient for solving the QP 
problem in the standard M 3 N, our preliminary empirical results show that such a scheme 
with an LP solver for the L\ -regularized M 3 N can be extremely expens ive for a non-trivia l 
real data set. Another possib le solution is t he gr adient descent methods ( Ratliff et al. . 20071 ) 



with a projection to Li-ball ( Duchi et al. . 20081 ) 



The MaxEnDNet int erpretation of the M 3 N offers a n alternative strategy that resembles 
Bayesian regularization ( Tipping . 2001 ; Kaban, 20071 ) in maximum likelihood estimation, 
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where shrinkage effects can be introduced by appropriate priors over the model parame- 
ters. As Theorem [3] reveals, an M 3 N corresponds to a Gaussian MaxEnDNet that admits a 
standard normal prior for the weight vector w. According to the standard Bayesian regu- 
larization theory, to achieve a sparse estimate of a model, in the posterior distribution of the 
feature weights, the weights of irrelevant features should peak around zero with very small 
variances. However, the isotropy of the variances in all dimensions of the feature space un- 
der a standard normal prior makes it infeasible for the resulting M 3 N to adjust the variances 
in different dimensions to fit a sparse model. Alternatively, now we employ a Laplace prior 
for w to learn a Laplace MaxEnDNet. We show in the sequel that, the parameter posterior 
p(w) under a Laplace MaxEnDNet has a shrinkage effect on small weights, which is similar 
to directly applying an Li-regularizer on an M 3 N. Although exact learning of a Laplace 
MaxEnDNet is also intractable, we show that this model can be efficiently approximated 
by a variational inference procedure based on existing methods. 

The Laplace prior of w is expressed as p (w) = nJLl ^e - ^^ = (^) Jr e -v ^l w H. 
This density function is heavy tailed and peaked at zero; thus, it encodes a prior belief that 
the distribution of w is strongly peaked around zero. Another nice property of the Laplace 
density is that it is log-concave, or the negative logarithm is c onvex, which can be exploited 
to obtain a convex estimation problem analogous to LASSO (|Tibshiranil . [l996h . 



Theorem 5 (Laplace MaxEnDNet: a sparse M 3 N) Assuming F(x, y;w) = w T f(x, y), 

U(0 = CYiiti, and po(w) = Uk=i ^ e ~^ lWkl = (^J^e-^HI, then the Lagrangian 
multipliers ai(y) in p(w) (as defined in Theorem 2) are obtained by solving the following 
dual problem: 

K A 

max V aj(y)A^(y) - Vlog- ^ 

a ^— ' ^-^ A — nf 

i, y ^ r k=l ' k 

s.t. Yl = c > a *(y) ^ °> Vf < v y + y*- 

y¥=y i 

where n k = X^y^i ^(yJAf^y), and Af^(y) = /fc(x',y*) - /fc(x*,y) represents the kth 
component o/Afj(y). Furthermore, constraints n\ < A, V/c, must be satisfied. 

Since several intermediate results from the proof of this Theorem will be used in subse- 
quent presentations, we provide the complete proof bel ow. Our proof is based on a hierar- 
chical representation of the Laplace prior. As noted in (|Figueiredol . l2003l ). the Laplace dis- 



tribution p(w) = y^g-v^H - g e q U ivalent to a two-layer hierarchical Gaussian-exponential 
model, where w follows a zero-mean Gaussian distribution p(w\r) = M(w\0,t) and the 
variance r admits an exponential hyper-prior density, 

P( T I A ) = ^ ex P { - for t>0. 

This alternative form straightforwardly leads to the following new representation of our 
multivariate Laplace prior for the parameter vector w in MaxEnDNet: 

K K 

Po(w) = IJy>o(^fc) = PI / P( w k\Tk)p(Tk\^) dr k = / p(w\t)p(t\X) dr, (4) 
k=l k=X J J 
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where p(w|r) = Y\k=i P( w k\T~k) and p(r\X) = Tlk=iP( T k\^) represent multivariate Gaussian 
and exponential, respectively, and dr = dri • • • drx- 

Proof (of Theorem^ Substitute the hierarchical representation of the Laplace prior (Eq. 
H]) into p(w) in Theorem [21 and we get the normalization factor Z(a) as follows, 

Z(a) = J J p(w\t)p(t\X) dr ■ exp{w T r/ - ^ a i (y)A£ i (y)} dw 
= J p(r\X) J p(w|t) • exp{w T ?7 - ^ «i(y)A^(y)} dw dT 
= j p{r\X) J N(w\0,A)exp{w T ri- a 4 (y)A^(y)} dw dr 

= J p^expi^Aij- Oi(y)A£i(y)}dr 

-TT /"A A 1 

= exp{- ^ Oi(y)A£i(y)} II / 2 ex P(~2 Tfc ' )eXP ^2^ rA; ' )dr ' c 



i,y^y* fe=l 

exp{- £ ai (y)A4(y)}nx^' (5) 

i.y^yi k=l 



where A = diag(r/%) is a diagonal matrix and r] is a column vector with % defined as in 
Theorem [5j The last equality is due to the moment generating function of an exponential 
distribution. The constraint rj^ < A, VA; is needed in this derivation to avoid the integra- 
tion going infinity. Substituting the normalization factor derived above into the general 
dual problem Dl in Theorem [2 and using the same argument of the convex conjugate of 
U(£) = C Y2i & as m Theorem [3l we arrive at the dual problem in Theorem [5j ■ 

It can be shown that the dual objective function of Laplace MaxEnDNet in Theorem [5] 
is concav^E But since each % depends on all the dual variables a and rjf. appears within 
a logarithm, the optimizati on problem underlyi ng Laplace MaxEnDNet would be very dif- 
ficult to solve The SMO (jTaskar et all hmi s and the exponentiated gradient methods 



(jBartlett et all 12004 ) developed for the QP dual problem of M 3 N cannot be easily applied 
here. Thus, we will turn to a variational approximation method, as shown in Section For 
completeness, we end this section with a corollary similar to the Corollary [H which states 
the primal optimization problem underlying the MaxEnDNet with a Laplace prior. As we 
shall see, the primal optimization problem in this case is complicated and provides another 
perspective of the hardness of solving the Laplace MaxEnDNet. 

Corollary 6 Under the same assumptions as in Theorem 0, the mean p of the posterior 
distribution p(w) under a Laplace MaxEnDNet is obtained by solving the following primal 



3. rj 2 . is convex over a because it is the composition of f(x) = x 2 wit h an affine mapping. So, A — rj j is 
concave and log(A — rj\) is also concave due to the composition rule (|Bovd and Vandenbergha 12004 ) . 



11 



problem: 



k=l i=l 

s.t. /i T Afi(y) > A£,(y) - & > 0, Vi, Vy / y\ 
Proof The proof requires the result of Corollary [JJ We defer it to Appendix B.4. ■ 

Since the "norm"@ 

f ( 1 i V^ + V n I, 

g IV + A - 7X 2 ) = MKL 

corresponds to the KL-divergence between p(w) and Po( w ) under a Laplace MaxEnDNet, 
we will refer to it as a KL-norm and denote it by || • \\kl in the sequel. This KL-norm is 
different from the L2- norm as used in M 3 N, but is closely related to the Li-norm, which 
encourages a sparse estimator. In the following section, we provide a detailed analysis of 
the sparsity of Laplace MaxEnDNet resulted from the regularization effect from this norm. 

4. Entropic Regularization and Sparse M 3 N 

Comparing to the structured prediction law ho due to an M 3 N, which enjoys dual sparsity 
(i.e., few support vectors), the hi defined by a Laplace MaxEnDNet is not only dual-sparse, 
but also primal sparse; that is, features that are insignificant will experience strong shrinkage 
on their corresponding weight Wk- 

The primal sparsity of h\ achieved by the Laplace MaxEnDNet is due to a shrinkage 
effect resulting from the Laplacian entropic regularization. In this section, we take a close 
look at this regularization effect, in comparison with other common regularizers, such as 
the L2-norm in M 3 N (which is equivalent to the Gaussian MaxEnDNet), and the Li-norm 
that at least in principle could be directly applied to M 3 N. Since our main interest here is 
the sparsity of the structured prediction law hi, we examine the posterior mean under p(w) 
via exact integration. It can be shown that under a Laplace MaxEnDNet, p(w) exhibits 
the following posterior shrinkage effect. 

Corollary 7 (Entropic Shrinkage) The posterior mean of the Laplace MaxEnDNet has 
the following form: 

K)p = V^2> \/l < k < K, (6) 
where r) k = Ei, y ^ y * «i(y)(/fc(x\ y l ) - /fc(x 4 ,y)) and r}\ < A, Vfc. 

4. This is not exactly a norm because the positive scalability does not hold. But the KL-norm is non- 
negative due to the non-negativity of KL-divergence. In fact, by using the inequality e x > 1 + x, we can 

show that each component (^/ m| + j — log V^Eiii) i s monotonically increasing with respect to 

/x| and > K, where the equality holds only when fi = 0. Thus, penalizes large weights. 

For convenient comparison with the popular L2 and L\ norms, we call it a KL-norm. 
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Figure 1: Posterior means with different priors against their corresponding r/ = 
Yli y^y i Q i(y)Af«(y)- Note that the r\ for different priors are generally different 
because of the different dual parameters. 



Proof Using the integration result in Eq. ([5]), we can get: 



glog Z 
dai(y) 



v T Ai i (y) - A£i(y) 



(7) 



where v is a column vector and 



2r? fc 



, VI < k < K. An alternative way to compute the 



derivatives is using the definition of Z : Z = f po( w ) -exp{w rj — £\ y ^ yi aj(y)A£j(y)} dw . 
We can get: 

° W (w)JAf i (y)-A€ i (y)- (8) 



da.i(y) 



Comparing Eqs. (|7j) and (jHJ), we get (w) p = v, that is, (wk) p — 
constraints 77? < A, VA: are required to get a finite normalization factor as shown in Eq. ([5]). 



„ - Vl< fc < K. The 



A plot of the relationship between {w^) p under a Laplace MaxEnDNet and the corre- 
sponding r)k revealed by Corollary [7] is shown in Figure [T] (for example, the red curve), from 
which we can see that, the smaller the % is, the more shrinkage toward zero is imposed on 

(Wk)p. 

This entropic shrinkage effect on w is not present in the standard M 3 N, and the Gaussian 
MaxEnDNet. Recall that by definition, the vector rj = £^ aj(y)Afj(y) is determined by 
the dual parameters cti(y) obtained by solving a model-specific dual problem. When the 
Q!j(y)'s are obtained by solving the dual of the standard M 3 N, it can be shown that the 
optimum point solution of the parameters w* = rj. When the aj(y)'s are obtained from 
the dual of the Gaussian MaxEnDNet, Theorem [3] shows that the posterior mean of the 
parameters ( w )p Gaussian = V- (-^ s we have already pointed out, since these two dual problems 
are isomorphic, the aj(y)'s for M 3 N and Gaussian MaxEnDNet are identical, hence the 
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resulting 77's are the same.) In both cases, there is no shrinkage along any particular 
dimension of the parameter vector w or of the mean vector of p(w). Therefore, although 
both M 3 N and Gaussian MaxEnDNet enjoy the dual sparsity, because the KKT conditions 
imply that most of the dual parameters «i(y)'s are zero, w* and ( w )p G are not primal 
sparse. From Eq. (0), we can conclude that the Laplace MaxEnDNet is also dual sparse, 
because its mean (w) p i can be uniquely determined by 77. But the shrinkage effect on 
different components of the (w)„ vector causes (w) D to be also primal sparse. 

r N '^Laplace N '^Laplace 1 c 

A comparison of the posterior mean estimates of w under MaxEnDNet with three dif- 
ferent priors versus their associated 77 is shown in Figure [TJ The three priors in question 
are, a standard normal, a Laplace with A = 4, and a Laplace with A = 6. It can be seen 
that, under the entropic regularization with a Laplace prior, the (w) p gets shrunk toward 
zero when 77 is small. The larger the A value is, the greater the shrinkage effect. For a fixed 
A, the shape of the shrinkage curve (i.e., the (w) p — 77 curve) is smoothly nonlinear, but no 
component is explicitly discarded, that is, no weight is set explicitly to zero. In contrast, 
for the Gaussian MaxEnDNet, which is equivalent to the standard M 3 N, there is no such a 
shrinkage effect. 

Corollary [6] offers another perspective of how the Laplace MaxEnDNet relates to the 
Li-norm M 3 N, which yields a sparse estimator. Note that as A goes to infinity, the KL- 
norm approaches ||/x||i, i.e., the Li-norrrH. This means that the MaxEnDNet with 

a Laplace prior will be (nearly) the same as the Li-M 3 N if the regularization constant A is 
large enough. 

A more explicit illustration of the entropic regularization under a Laplace MaxEnD- 
Net, comparing to the conventional L\ and L2 regularization over an M 3 N, can be seen 
in Figure EJ where the feasible regions due to the three different norms used in the regu- 
larizer are plotted in a two dimensional space. Specifically, it shows (1) L2-norm: w\ + 
w\ < 1; (2) Li-norm : + \w 2 \ < 1; and ( 2) KL-no rrrQ %/wf + 1/A + y/w'% + 1/A - 
(l/VX) \og(y/\w'{ + 1/2 + 1/2) - (l/\/A) log(y/Xw'( + 1/2 + 1/2) < b, where b is a param- 
eter to make the boundary pass the (0, 1) point for easy comparison with the L2 and L\ 
curves. It is easy to show that b equals to \/l/X+ \J\ + 1/A — (l/y/X) \og(y/X + 1/2 + 1/2). 
It can be seen that the Li-norm boundary has sharp turning points when it passes the 
axises, whereas the Li and KL-norm boundaries turn smoothly at those points. This is 
the intuitive explanation of why the Li-norm directly gives sparse estimators, whereas the 
L2-norm and KL-norm due to a Laplace prior do not. But as shown in Figure [2(b)] when 
the A gets larger and larger, the KL-norm boundary moves closer and closer to the Li-norm 
boundary. When A - > 00, y/w'{ + 1/A + y/w% + 1/A - (l/VX) log(VAwf + l/2 + 1/2) - 
(1/VA) log(yAiof + 1/2 + 1/2) — > |iwi| + \w2\ and 6—^1, which yields exactly the Li-norm 
in the two dimensional space. Thus, under the linear model assumption of the discriminant 
functions F( • ;w), our framework can be seen as a smooth relaxation of the Li-M 3 N. 



5. As A — + 00, the logarithm terms in ||/i||jrL disappear because of the fact that —> when x —> 00. 

6. The curves are drawn with a symbolic computational package to solve a equation of the form: 2x — log x = 
a, where x is the variable to be solved and a is a constant. 
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(a) 



(b) 



Figure 2: (a) L2-norm (solid line) and Li-norm (dashed line); (b) KL-norm with different 



5. Variational Learning of Laplace MaxEnDNet 

Although Theorem [2] seems to offer a general closed-form solution to p(w) under an arbitrary 
prior j>o( w )) i n practice the Lagrangian parameters Oj(y) in p(w) can be very hard to 
estimate from the dual problem Dl except for a few special choices of £>o( w )> such as a 
normal as shown in Theorem [3l which can be easily generalized to any normal prior. When 
Po(w) is a Laplace prior, as we have shown in TheoremOand Corollary El the corresponding 
dual problem or primal problem involves a complex objective function that is difficult to 
optimize. Here, we present a variational method for an approximate learning of the Laplace 
MaxEnDNet. 

Our approach is built on the hierarchical interpretation of the Laplace prior as shown 
in Eq. Q. Replacing the po(w) in Problem PI with Eq. and applying the Jensen's 
inequality, we get an upper bound of the KL-divergence: 



where q(r) is a variational distribution used to approximate p(r\\). The upper bound is 
in fact a KL-divergence: C(p(w), q(r)) = KL{p{ r w)q{r)\ |p(w|r)p(r| A)). Thus, C is convex 
over p(w), and q(r), respectively, but not necessarily joint convex over (p(w),g(r)). 



Laplace priors. 
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Algorithm 1 Variational MaxEnDNet 



Input: data T> = {(x l ,y*)}^ l5 constants C and A, iteration number T 
Output: posterior mean (w)^ 
Initialize (w)* <— 0, S 1 <— I 
for t = 1 to T - 1 do 

Step 1: solve ([TOj) or ([IT]) for (w)* +1 = £*7/; update (ww T )* +1 <- S* + (w)* +1 ((w)* +1 ) 

Step 2: use <H2J> to update £ m «- diag( yf^f^). 
end for 



Substituting this upper bound for the KL-divergence in PI, we now solve the following 
Variational MaxEnDNet problem, 

PI' (vMEDN) : min £(p(w), q(r)) + 17(0 ■ ( 9 ) 

p(w)e.Fi;g(T);£ 

PI' can be solved with an iterative minimization algorithm alternating between opti- 
mizing over (p(w),£) and q(r), as outlined in Algorithm [Q and detailed below. 

Step 1: Keep q(r) fixed, optimize Pi' with respect to (p(w),£)- Using the same pro- 
cedure as in solving Pi, we get the posterior distribution p(w) as follows, 



p(w) oc exp{ J q(r) logp(w|r) dr - b} • exp{w T r/ - ^ a;(y)A^(y)} 
ocexp{-iw T (^- 1 ) g w-6 + w T r ? - a<(y)A£ i (y)} 



«,y^y' 

= AA(w|/i,S), 

where rj = I] i>y ^ yl ai(y)Af;(y), A = diag(r fc ), and b = KL(q(r)\ \p(r\ A)) is a constant. 
The posterior mean and variance are (w) p = \i = X77 and E = ((A~ 1 ) q )~ 1 = (ww T ) p — 
(w)p(w)J, respectively. Note that this posterior distribution is also a normal distribution. 
Analogous to the proof of Theorem 3, we can derive that the dual parameters a are estimated 
by solving the following dual problem: 

max V a l (y)A£ i (y) - -r^Er? (10) 

a * — • z 

«,y^y ,; 

s.t. Yj = c ' a ^y) ^ °> Vi > v ^ ^ y*- 

y^y 1 

This dual problem is now a standard quadratic program symbolically identical to the 
dual of an M 3 N, and can be directly solved usin g existing algorithms developed for M 3 N, 
such as (|Taskar et all . 120031 : iBartlett et all . 12004) 1 . Alternatively, we can solve the following 
primal problem: 

N 



mm 

w,£ 2 



^w T E- 1 w + Cj2^i (11) 



i=i 



s.t. w T Afi(y) > A^(y) - 6 > 0, V», Vy / y\ 
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Based on the proof of Corollary [U it is easy to show that the solution of the problem (jlip 
leads to the posterior mean of w under p(w), which will be used to do pred iction by 



hi. T he primal problem can be s olved with the sub gradient (Rat 



plane (jTsochantaridis et al.l . 12004 ) , or extragradient (jTaskar et al 



iff et all [20071 ) . cutting- 



20061 ) method. 



Step 2: Keep p(w) fixed, optimize Pi' with respect to q(r). Taking the derivative of C 
with respect to q(r) and set it to zero, we get: 

q(r) ocp(r|A)exp{(logp(w|r)) p }. 

Since both p(w|r) and p(r\X) can be written as a product of univariate Gaussian and 
univariate exponential distributions, respectively, over each dimension, q(r) also factorizes 
over each dimension: q(r) = Y\k=i l(. T k), where each q(r k ) can be expressed as: 

Vfc : q(r k ) oc p(r fe |A) exp {(log p(w k \T k )) p } 
oc AF(yJ (w 2 k ) p \0, r k ) exp(--AT fe ). 



The same distribution has been derived in (jKabanl . 120071 ) , and similar to the hierarchical rep 



resentation of a Laplace distribution we can get the normalization factor: f N{y (w^) p \0, r k )- 

Also, as in (jKabanl . [2007h . we can calculate the 



A(^ 2 



klP) 



|exp(-|AT fe )dr fc = ^exp(- 
expectations (T^ 1 ) q which are required in calculating (A~ 1 ) q as follows, 



1 f 1 

(— ) q = / — q(n)dT k 

Tk J T k 



( W l)p' 



(12) 



We iterate between the above two steps until convergence. Due to the convexity (not 
joint convexity) of the upper bound, the algorithm is guaranteed to converge to a local 
optimum. Then, we apply the posterior distribution p(w), which is in the form of a normal 
distribution, to make prediction using the averaging prediction law in Eq. ([2]). Due to the 
shrinkage effect of the Laplacian entropic regularization discussed in Section HI for irrelevant 
features, the variances should converge to zeros and thus lead to a sparse estimation of w. 
To summarize, the intuition behind this iterative minimization algorithm is as follows. First, 
we use a Gaussian distribution to approximate the Laplace distribution and thus get a QP 
problem that is analogous to that of the standard M 3 N; then, in the second step we update 
the covariance matrix in the QP problem with an exponential hyper-prior on the variance. 



6. Generalization Bound 



The PAC-Bayes theory for averaging classifiers (jLangford et all l200ll ) provides a theoreti- 
cal motivation to learn an averaging model for classification. In this section, we extend the 
classic PAC-Bayes theory on binary classifiers to MaxEnDNet, and analyze the generaliza- 
tion performance of the structured prediction rule hi in Eq. ([2]). In order to prove an error 
bound for hi, the following mild assumption on the boundedness of discriminant function 
F( ■ ;w) is necessary, i.e., there exists a positive constant c, such that, 



Vw, F(.;w)eH; Xxy^[-c,c}. 
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Recall that the averaging structured prediction function under the MaxEnDNet is defined 
as h(pc, y) = (F(x, y;w)) p ( w ). Let's define the predictive margin of an instance (x,y) 
under a function h as M(h,x,y) = h(x, y) — maxy/^y h(x, y'). Clearly, h makes a wrong 
prediction on (x, y) only if M(/i,x, y) < 0. Let Q denote a distribution over X x y, and 
let T> represent a sample of ./V instances randomly drawn from Q. With these definitions, 
we have the following structured version of PAC-Bayes theorem. 

Theorem 8 (PAC-Bayes Bound of MaxEnDNet) Let po be any continuous probabil- 
ity distribution over 7i and let S € (0,1). If F( ■ ;w) £ H is bounded by ±c as above, then 
with probability at least 1 — 5, for a random sample T> of N instances from Q, for every 
distribution p over Tt, and for all margin thresholds 7 > 0: 



Pr Q (M(/i,x,y) <0) <Pr v (M(h,x,y) < 7) + O 



j- 2 KL(p\\p )ln(N\y\) +lnN + ln«J 
TV 



-1 



where Prg(.) and Prx>(.) represent the probabilities of events over the true distribution Q, 
and over the empirical distribution of T>, respectively. 

The proof of Theorem [5] follows the same spirit of the proof of the original PAC-Bayes 
bound, but with a number of technical extensions dealing with structured outputs and 
margins. See appendix B . 5 for t he details. 

Recently, iMcAllesterl (|2007l ) presents a stochastic max-margin structured prediction 
model, which is different from the averaging predictor under the MaxEnDNet model, 
by designing a "posterior" distribution from which a model is sampled to make predic- 
tion. A PAC-Bayes bound with an improved dependence on |^| was shown in this model. 
Langford and Shawe- Taylor ( 20031 ) show an interesting connection between the PAC-Bayes 
bounds for averaging classifiers and stochastic classifiers, again by designing a posterior 
distribution. But our posterio r distribution is so lved with MaxEnDNet and is generally d if- 
ferent from those designed in dMcAllesteri . 120071 1 and (jLangford and Shawe-Tavlorl . l2003l ). 



7. Experiments 



In this section, we present empirical evaluations of the proposed Laplace MaxEnDNet 
(LapMEDN) on both synthetic and real data sets. We compare LapMEDN with M 3 N 
(i.e., the Gaussian MaxEnDNet), L\ -regularized M 3 N (Li-M 3 N), CRFs, Li-regularized 
CRFs (Li -CRFs), and / ^-regularized C RFs (L 2 -CRFs). We use the quasi- Newton method 
( Liu and NocedaJ . 19891 ) and its variant ( Andrew and Gaol . l2007l ) to solve the optimization 
problem of CRFs , Lt-CRFs, and L^ -CRFs. For M 3 N and LapMEDN, we use the sub- 



gradient method (IRatliff et al.1 . 120071 ) to solve the corresponding primal problem. To the 
best of our knowledge, no formal description, implementation, and evaluation of the L\- 
M 3 N exist in the literature, therefore how to solve Li-M 3 N remains an open problem and 
for comparison purpose we had to develop this model and algorithm anew. Details of our 
work along this line deserves a more thorough presentation, which is beyond the scope of 
this paper and will appear elsewhere. But briefly, for our experiments on syn thetic data, 
we implemented the constraint generating method (jTsochantaridis et all 12004 ) which uses 
MOSEK to solve an equivalent LP re-formulation of Li-M 3 N. However, this approach is ex- 
tremely slow on larger problems; therefore on real data we instead applied the sub-gradient 
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method ( Ratliff et all 120071 ) with a projection to an Li-ball ( Duchi et all 12008 ) to solve 
the larger Li-M 3 N based on the equivalent re- formulation with an Li-norm constraint (i.e., 
the second formulation in Appendix A). 



7.1 Evaluation on Synthetic Data 

We first evaluate all the competing models on synthetic data where the true structured 
predictions are known. Here, we consider sequence data, i.e., each input x is a sequence 
(xi, . . . ,xl), and each component X[ is a (i-dimensional vector of input features. The syn- 
thetic data are generated from pre-specified conditional random field models with either 
i.i.d. instantiations of the input features (i.e., elements in the d-dimensional feature vec- 
tors) or correlated (i.e., structured) instantiations of the input features, from which samples 
of the structured output y, i.e., a sequence (yi, . . . , ul), can be drawn from the conditional 
distribution p(y|x) defined by the CRF based on a Gibbs sampler. 



7.1.1 I.I.D. INPUT FEATURES 

The first experiment is conducted on synthetic sequence data with 100 i.i.d. input features 
(i.e., d = 100). We generate three types of data sets with 10, 30, and 50 relevant input 
features, respectively. For each type, we randomly generate 10 linear-chain CRFs with 8 
binary labeling states (i.e., L = 8 and 3^ = {0, 1}). The feature functions include: a real 
valued state-feature function over a one dimensional input feature and a class label; and 
4 (2 x 2) binary transition feature functions capturing pairwise label dependencies. For 
each model we generate a data set of 1000 samples. For each sample, we first independently 
draw the 100 input features from a standard normal distribution, and then apply a Gibbs 
sampler (based on the conditional distribution of the generated CRFs) to assign a labeling 
sequence with 5000 iterations. 

For each data set, we randomly draw a subset as training data and use the rest for testing. 
The sizes of training set are 30, 50, 80, 100, and 150. The QP proble m in M 3 N and the firs t 



step of LapMEDN is solved with the exponentiated gradient method (jBartlett et all 120041 ). 
In all the following experiments, the regularization constants of Li-CRFs and L2-CRFS are 
chosen from {0.01,0.1,1,4,9,16} by a 5-fold cross-validation during the training. For the 
LapMEDN, we use the same method to choose A from 20 roughly evenly spaced values 
between 1 and 268. For each setting, a performance score is computed from the average 
over 10 random samples of data sets. 

The results are shown in Figure El All the results of the LapMEDN are achieved with 
3 iterations of the variational learning algorithm. From the results, we can see that under 
different settings LapMEDN consistently outperforms M 3 N and performs comparably with 
Li-CRFs and Li-M 3 N, both of which encourage a sparse estimate; and both the Li-CRFs 
and L2-CRFS outperform the un-regularized CRFs, especially in the cases where the number 
of training data is small. One interesting result is tha t the M 3 N and L2-CRFS pe rform 



compara bly. This is reasonable because as derived by Lebanon and Lafferty ( 200ll ) and 



noted by iGloberson et al.1 (120071 ) that the L2-regularized maximum likelihood estimation 
of CRFs has a similar convex dual as that of the M 3 N, and the only difference is the loss 
they try to optimize, i.e., CRFs optimize the log-loss while M 3 N optimizes the hinge-loss. 
Another interesting observation is that when there are very few relevant features, Li-M 3 N 
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Figure 3: Evaluation results on data sets with i.i.d features. 
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Figure 4: Results on data sets with 30 relevant features. 

performs the best (slightly better than LapMEDN); but as the number of relevant features 
increases LapMEDN performs slightly better than the Li-M 3 N. Finally, as the number of 
training data increases, all the algorithms consistently achieve better performance. 

7.1.2 Correlated input features 

In reality, most data sets contain redundancies and the input features are usually correlated. 
So, we evaluate our models on synthetic data sets with correlated input features. We take 
the similar procedure as in generating the data sets with i.i.d. features to first generate 
10 linear-chain CRF models. Then, each CRF is used to generate a data set that contain 
1000 instances, each with 100 input features of which 30 are relevant to the output. The 
30 relevant input features are partitioned into 10 groups. For the features in each group, 



20 



we first draw a real- value from a standard normal distribution and then corrupt the feature 
with a random Gaussian noise to get 3 correlated features. The noise Gaussian has a 
zero mean and standard v ariance 0.05. Here and in all the remaining experiments, we use 
the sub-gradient method ( Ratliff et al. . 20071 ) to solve the QP problem in both M 3 N and 
the variational learning algorithm of LapMEDN. We use the learning rate and complexity 
constant that are suggested by the authors, that is, at = i^p t an d C = 200/?, where is 
a parameter we introduced to adjust at and C. We do K-fold C V on each da t a set and 



take the average over the 10 data sets as the final results. Like (jTaskar et al.1 . 120031 ). in 



each run we choose one part to do training and test on the rest K-l parts. We vary K from 
20, 10, 7, 5, to 4. In other words, we use 50, 100, about 150, 200, and 250 samples during 
the training. We use the same grid search to choose A and (5 from {9, 16, 25, 36, 49, 64} and 
{1,10,20,30,40,50,60} respectively. Results are shown in Figured We can get the same 
conclusions as in the previous results. 

Figure [5] shows the true weights of the corresponding 200 state feature functions in 
the model that generates the first data set, and the average of estimated weights of these 
features under all competing models fitted from the first data set. All the averages are 
taken over 10 fold cross-validation. From the plots (2 to 7) of the average model weights, 
we can see that: for the last 140 state feature functions, which correspond to the last 70 
irrelevant features, their average weights under LapMEDN (averaged posterior means w in 
this case), Li-M 3 N and -Li-CRFs are extremely small, while CRFs and L2-CRFS can have 
larger values; for the first 60 state feature functions, which correspond to the 30 relevant 
features, the overall weight estimation under LapMEDN is similar to that of the sparse Li- 
CRFs and Li-M 3 N, but appear to exhibit more shrinkage. Noticeably, CRFs and L2-CRFS 
both have more feature functions with large average weights. Note that all the models have 
quite different average weights from the model (see the first plot) that generates the data. 
This is because we use a stochastic procedure (i.e., Gibbs sampler) to assign labels to the 
generated data samples instead of using the labels that are predicted by the model that 
generates the data. In fact, if we use the model that generates the data to do prediction 
on its generated data, the error rate is about 0.5. Thus, the learned models, which get 
lower error rates, are different from the model that generates the data. Figure [6] shows 
the variances of the 100-dimensional input features (since the variances of the two feature 
functions that correspond to the same input feature are the same, we collapse each pair 
into one point) learned by LapMEDN. Again, the variances are the averages over 10 fold 
cross-validation. From the plot, we can see that the LapMEDN can recover the correlation 
among the features to some extend, e.g., for the first 30 correlated features, which are 
the relevant to the output, the features in the same group tend to have similar (average) 
variances in LapMEDN, whereas there is no such correlation among all the other features. 
From these observations in both Figure [5] and El we can conclude that LapMEDN can 
reasonably recover the sparse structures in the input data. 



7.2 Real- World OCR Data Set 



The OCR data set is partitioned into 10 subsets for 10-fold CV as in dTaskar et all 120031 : 



Ratliff et al.1 . 120071 ) . We randomly select ./V samples from each fold and put them together 
to do 10-fold CV. We vary N from 100, 150, 200, to 250, and denote the selected data 
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Figure 5: From top to bottom, plot 1 shows the weights of the state feature functions in the 
linear-chain CRF model from which the data are generated; plot 2 to plot 7 show 
the average weights of the learned LapMEDN, M 3 N, Li-M 3 N, CRFs, L 2 -CRFs, 
and Li-CRFs over 10 fold CV, respectively. 
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Figure 6: The average variances of the features on the first data set by LapMEDN. 
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Figure 7: Evaluation results on OCR data set with different numbers of selected data. 



sets by OCR100, OCR150, OCR200, and OCR250, respectively. On these data sets and 
the web data as in Section 17,41 our implementation of the cutting plane method for L\- 
M 3 N is extremely slow. The warm-start simplex method of MOSEK does not help either. 
For example, if we stop the algorithm with 600 iterations on OCR100, then it will take 
about 20 hours to finish the 10 fold CV. Even with more than 5 thousands of constraints 
in each training, the performance is still very bad (the error rate is about 0.45). Thus, we 
turn to an approximate pr o j ected sub-gradient metho d to solve the Li-M 3 N by combining 
the on-line subgradient meth od ( Ratliff et al. . 20071 ) and the efficient Li-ball projection 
algorithm ( Duchi et al. . 20081 ). The projected sub-gradient method does not work so well 
as the cutting plane method on the synthetic data sets. That's why we use two different 
methods. 

For = 4 on OCR100 and OCR150, (3 = 2 on OCR200 and OCR250, and A = 36, 
the results are shown in Figure [71 We can see that as the number of training instances 
increases, all the algorithms get lower error rates and smaller variances. Generally, the 
LapMEDN consistently outperforms all the other models. M 3 N outperforms the standard, 
non-regularized, CRFs and the Li-CRFs. Again, L2-CRFS perform comparably with M 3 N. 
This is a bit surprisi ng but still reasonable d ue to the understanding of their only difference 
on the loss functions ( Globerson et al. . 20071 ) as we have stated. By examining the prediction 
accuracy during the learning, we can see an obvious over-fitting in CRFs and Li-CRFs as 
shown in Figure In contrast, L2-CRFS are very robust. This is because unlike the 
synthetic data sets, features in real-world data are usually not completely irrelevant. In 
this case, putting small weights to zero as in Li-CRFs will hurt generalization ability and 
also lead to instability to regularization constants as shown later. Instead, L2-CRFS do 
not put small weights to zero but shrink them towards zero as in the LapMEDN. The non- 
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Figure 8: The error rates of CRF models on test data during the learning. For the left 
plot, the horizontal axis is y^l/ratioLL, where ratioLL is the relative change 
ratios of the log-likelihood and from left to right, the change ratios are 1, 0.5, 0.4, 
0.3, 0.2, 0.1, 0.05, 0.04, 0.03, 0.02, 0.01, 0.005, 0.004, 0.003, 0.002, 0.001, 0.0005, 
0.00 04, 0.0003, 0. 0002, 0.0001, and 0.00005; for the right plot, the horizontal axis 
is ^1000/negLL, where negLL is the negative log-likelihood, and from left to 
right negLL are 1000, 800, 700, 600, 500, 300, 100, 50, 30, 10, 5, 3, 1, 0.5, 0.3, 
0.1, 0.05, 0.03, 0.01, 0.005, 0.003, and 0.002. 



regularized maximum likelihood estimation can easily lead to over-fitting too. For the two 
sparse models, the results suggest the potential advantages of Li-norm regularized M 3 N, 
which are consistently better than the Li-CRFs. Furthermore, as we shall see later, Li-M 3 N 
is more stable than the Li-CRFs. 

7.3 Sensitivity to Regularization Constants 

Figure [9] shows the error rates of the models in question on the data set OCR100 over 
different magnitudes of the regularization constants. For M 3 N, the regularization constant 
is the parameter C, and for all the other models, the regularization constant is the parameter 
A. When the A changes, the parameter C in LapMEDN and Li-M 3 N is fixed at the unit 1. 

From the results, we can see that the Li-CRFs are quite sensitive to the regularization 
constants. However, L 2 -CRFs, M 3 N, Li-M 3 N and LapMEDN are much less sensitive. 
LapMEDN and L X -M 3 N are the most stable models. The stability of LapMEDN is due 
to the posterior weighting instead of hard-thresholding to set small weights to zero as in 
the Li-CRFs. One interesting observation is that the max-margin based Li-M 3 N is much 
more stable compared to the Li-norm regularized CRFs. One possible reason is that like 
LapMEDN, Li-M 3 N enjoys both the primal and dual sparsity, which makes it less sensitive 
to outliers; whereas the Li-CRF is only primal sparse. 
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Figure 9: Error rates of different models on OCR100 with different regularization constants. 

The regularization constant is the parameter C for M 3 N, and for all the other 
models, it is the parameter A. From left to right, the regularization constants for 
the two regularized CRFs (above plot) are 0.0001, 0.001, 0.01, 0.1, 1, 4, 9, 16, 
and 25; for M 3 N and LapMEDN, the regularization constants are k 2 , l<k<9; 
and for Li-M 3 N, the constants are k 2 , 13 < k < 21. 



7.4 Real- World Web Data Extraction 



The last experiments are conducted o n another problem regarding the real world web data 
extraction, as extensively studied in (IZhu et all l2008al ). Web data extraction is a task to 
identify interested information from web pages. Each sample is a data record or an entire 
web page which is represented as a set of HTML elements. One striking characteristic of web 
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data extraction is that various types of structural dependencies between HTML elements 
exist, e.g. the HTML tag tree or the Document Object Model (DOM) structure is itself 



hierarchical. In (|Zhu et al.l . l2008al ). hierarchical CRFs are shown to h ave great promise an d 



achieve better performance than flat models like linear-chain CRFs ( Laffertv et all l200ll ). 



One method to construct a hierarchical model is to first use a parser to construct a so called 
vision tree. Then, based on the vision tree, a hierarchical model can be constructed accord- 
ingly to extract the interes ted attributes, e.g. a product's name, image, price, description, 



etc. See (|Zhu et al.l . l2008al ) for an example of the vision tree and the corresponding hierar- 
chical model. In such a hierarchical extraction model, inner nodes are useful to incorporate 
long distance dependencies, and the variables at one level are refinements of the variables 
at upper levels. 

In these experimental, we identify product items for sale on the Web. For each product 
item, four attributes — Name, Image, Price, and Description are extr acted. We use the data 
set that is built with web pages generated by 37 different templates ( Zhu et al. . 2008al ). For 



each template, there are 5 pages for training and 10 for testing. We evaluate all the methods 
on the record level, that is, we assume that data records are given, and we compare different 
models on the accuracy of extracting attributes in the given records. In the 185 training 
pages, there are 1585 data records in total; in the 370 testing pages, 3391 data records 
are collected. As for the evaluation criteria, we use the two comprehensiv e measures, i.e. 
average Fl and block instance accuracy. As defined in (|Zhu et all l2008ah . average Fl is 



the average value of the Fl scores of the four attributes, and block instance accuracy is the 
percent of data records whose Name, Image, and Price are all correctly identified. 

We randomly select m = 5, 10, 15, 20, 30, 40, or, 50 percent of the training records as 
training data, and test on all the testing records. For each m, 10 independent experiments 
were conducted and the average performance is summarized in Figure fTUl From the results, 
we can see that all: first, the models (especially the max-margin models, i.e., M 3 N, Li-M 3 N, 
and LapMEDN) with regularization (i.e., Li-norm, L2" n orm, or the entropic regularization 
of LapMEDN) can significantly outperform the un-regularized CRFs. Second, the max- 
margin models generally outperform the conditional likelihood-based models (i.e., CRFs, 
L 2 -CRFs, and Li-CRFs). Third, the LapMEDN perform comparably with the Li-M 3 N, 
which enjoys both dual and primal sparsity as the LapMEDN, and outperforms all other 
models, especially when the number of training data is small. Finally, as in the previous 
experiments on OCR data, the Li-M 3 N generally outperforms the Li-CRFs, which suggests 
the potential promise of the max-margin based Li-M 3 N. A detailed discussion and validation 
this new model (Li-M 3 N) is beyond the scope of this paper, and will be deferred to a later 
paper. 

8. Related Work 

Ou r work is motivated b y the maximum entropy discrimination (MED) method proposed 
by ( Jaakkola et al. . 1999), which integrates SVM and entropic regularization to obtain an 



averaging maximum margin model for classification. The MaxEnDNet model presented is 
essentially a structured version of MED built on M 3 N— the so called "structured SVM" . 



7. These experiments are slightly different from those in (jZhu et all . l2008al). Here, we introduce more 
general feature functions based on the content and visual features as in (|Zhu et all l2008al ). 
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Figure 10: The average Fl values and block instance accuracy on web data extraction with 
different number of training data. 



As we presented in this paper, this extension leads to a substantially more flexible and 
powerful new paradigm for structured discriminative learning and prediction, which enjoys 
a number of advantages such as model averaging, primal and dual sparsity, accommodation 
of latent generative structures, but at the same time as raises new algorithmic challenges 
in inference and learning. 

Related to our approach, a sparse Bayesian learning framework has been proposed to 
find sparse and robust solutions to regres sion and classifi cation. One example along this line 
is the relevance vector machine (RVM) ( Tipping . 200ll ). The RVM was proposed based on 
SVM. But unlike SVM which directly optimizes on the margins, RVM defines a likelihood 
function from the margins with a Gaussian distribution for regression and a logistic sig- 
moid link function for classification and then does type-II maximum likelihood estimation, 
that is, RVM maximizes the marginal l ikelih ood. Although called s parse Bayesian learning 
( Figueiredo . 2001 ; Evheramendv et al. . 20031 ) . as shown in (Kaban, 2007 ) the s parsity is ae 



tually due to the MAP estimation. The similar ambiguity of RVM is justified in (jWipf et al 
20031 ). Unlike these approaches, we adhere to a full Bayesian-style principle and learn a 



distribution of predictive models by optimizing a generalized maximum entropy under a set 
of the expected margin constraints. By defining likelihood functions with margins, similar 
Baye s ian interpretations of bot h binary and multi-class SVM can also be found in (jSollichl . 
20021 : IZhang and Jordanl . l2006h . 



The hierarchical interpretation of the Laplace prior has been explored in a number of 
contexts in the literature. Based o n this interpretati on, a Jeffrey's non-informative second- 
level hyper-prior was proposed in (|Figueiredcl . l200lh . with an EM algorithm developed to 
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find the MA P estimate. The advantage of the Jeffrey's prior is that it is parameter-free. But 
as shown in (|Evheramendv et all l200i iKabanl . 120071 1 , us ually no advant age is achieved by 
using the Jeffrey's hyper-prior over the Laplace prior. In jTippineLliooil ). a gamma hyper- 
prior is used in place of the second-level exponential as in the hierarchical interpretation of 
the Laplace prior. 

To encourage sparsity in SVM, two strateg ies have been used. The first one i s to replace 
the L 2-norm by an Li-norm of the weights ( Bennett and Mangasarian . 19921 ; Zhu et al. . 
2004h . The second strategy is to explicitly add a cardinality constraint on the weights. 



This will lead to a hard non-convex optimization problem; thus relaxations must be applied 



(jChan et all 120071). Under the maximum e ntropy discrimination models, feature selection 



was studied in ( Jebara and Jaakkola . 2000l ) by introducing a set of structural variables. It 
is straightforward to generalize it to the structured learning case but the resultant learning 
problem can be highly complex and approximation must be developed. 

Although the parameter distribution p(w) in Theor em [2] has a sim ilar form as that 



of the Bayesian Conditional Random Fields (BCRFs) dOi et all l2005h . MaxEnDNet is 



fundamentally different from BCRFs as we have stated, bredze et ahl & present an 
interesting confidence-weighted linear classification method, which automatically estimates 
the mean and variance of model parameters in online learning. The procedure is similar to 
(but indeed different from) our variational Bayesian method of Laplace MaxEnDNet. 

Finally, some of the results shown in this paper appeared in the conference paper 
(|Zhu et alll2008bh . 



9. Conclusions and Future Work 

To summarize, we have presented a general theory of maximum entropy discrimination 
Markov networks for structured input /output learning and prediction. This formalism of- 
fers a formal paradigm for integrating both generative and discriminative principles and the 
Bayesian regularization techniques for learning structured prediction models. It subsumes 
pop ular methods such as support vector machines, maximum entropy discrimination mod- 
els ( Jaakkola et al. . 19991 ). and maximum margin Markov networks as special cases, and 



therefore inherits all the merits of these techniques. 

The MaxEnDNet model offers a number of important advantage over conventional struc- 
tured prediction methods, including: 1) modeling averaging, which leads to a PAC-Bayesian 
bound on generalization error; 2) entropic regularization over max-margin learning, which 
can be leveraged to learn structured prediction models that are simultaneously primal 
and dual sparse; and 3) latent structures underlying the structured input/output vari- 
ables, which enables better incorporation of domain knowledge in model design and semi- 
supervised learning based on partially labeled data. In this paper, we have d iscussed in 



detail the first two aspects, and the third aspect is explored in (jZhu et al.l . l2008d ). We have 
also shown that certain instantiations of the MaxEnDNet model, such as the LapMEDN 
that achieves primal and dual sparsity, can be efficiently trained based on an iterative 
optimization scheme that employs existing techniques such as the variational Bayes ap- 
proximation and the convex optimization procedures that solve the standard M 3 N. We 
demonstrated that on synthetic data the LapMEDN can recover the sparse model as well 
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as the sparse Li -regularized MAP estimation, and on real data sets LapMEDN can achieve 
superior performance. 

Overall, we believe that the MaxEnDNet model can be extremely general and adap- 
tive, and it offers a promising new framework for building more flexible, generalizable, and 
large scale structured prediction models that enjoy the benefits from both generative and 
discriminative modeling principles. While exploring novel instantiations of this model will 
be an interesting direction to pursue, development of more efficient learning algorithms, 
formulation of tighter but easy to solve convex relaxations, and adapting this model to 
challenging applications such as statistical machine translation, and structured associations 
of genome markers to complex disease traits could also lead to fruitful results. 
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Appendix A. Li-M 3 N and its Lagrange-Dual 



Based on the Li -norm regularized SVM (jZhu et all 120041 : iBennett and Mangasarianl . ll992l ). 
a straightforward formulation of Li-M 3 N is as follows, 



N 



mm — w 

w,£ 2" 1 



i=l 



s.t. w T Af,(y) > A£i(y) - & & > 0, Vi, Vy + y' ; 

where |.| is the Li-norm. A£(y) = f(x*,y 4 ) — f(x l ,y), and A£j(y) is a loss function. 
Another equivalent formulation^] is as follows: 

N 

mm CVVj 

w,4 — J 



S.t. 



i=l 

I w || < A 



w 1 Af t (y) > A4(y) - & > 0, Vi, Vy ^ y l 



To derive the convex dual problem, we introduce a dual variable aj(y) for each con- 
straint in the former formulation and form the Lagrangian as follows, 

1 N 

L(a,w,0 = -||w|| + Cj>- «,(y)(w T Af l (y)-A^(y)+^)- 



i=l 



8. See ( Taskar et al. . 20061 ) for the transformation technique. 
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By definition, the Lagrangian dual is, 



L*(a) = inf L(a, w, £) 



inf 

w 



rl„ 

— w 

2" 



sup 



E a i (y)w T Af i (y)] +inf [C^^- E 

1 ^ 
w T ( E « i (y)Af i (y))--||w||] - sup [ E a;(y)& - Cj^ij + 



where £ = Ej, y ^ y « «j(y) A ^(y)- 

Again, by definition, the first term on the right-hand side is the convex conjugate of 
0(w) = |||w|| and the second term is the conjugate of U(£) = C Yli=i &• ^ is easy to show 
that, 

4* (a) = Ul E « 4 (y)Aff (y)| < \ VI < k < K), 

j.y^y* 

and 

y^y* 

where as defined before I 00 (-) is an indicator function that equals zero when its argument 
is true and infinity otherwise. Af^(y) = /fe(x*,y*) — /fc(x 8 ,y). 
Therefore, we get the dual problem as follows, 

max } a»(y)A4(y) 

a *■ — » 

i.y^y* 

s.t. 1 E «^(y) Af "(y)l < \ ^ k 

E a *(y) ^ c > Vi - 

y^y* 



Appendix B. Proofs of Theorems and Corollaries 
Appendix B.l. Proof of Theorem [2] 

Proof As we have stated, PI is a convex program and satisfies the Slater's condition. 
To compute its convex dual, we introduce a non- negative dual variable «i(y) for each 
constraint in T\ and another non-negative dual variable c for the normalization constraint 
J p(w) dw = 1. This gives rise to the following Lagrangian: 

L(p(w),(,a,c) =KL(p( W )\\ Po (w)) + U(0+c(J p(w) dw -1) 

- E y , p(w)[AF i ( y; w)-M(y)]dw+ei). 
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The Lagrangian dual function is defined as L*(a,c) = inf^^w L(p(w), £, a, c). Taking the 
derivative of L w.r.t p(w), we get, 

^- = l + c + log^-- £ a ? ,(y)[AF l (y;w)-A^(y)]. 
op(w) po(w) 

Setting the derivative to zero, we get the following expression of distribution p(w), 

P(w) = ^-p (w)exp{ V a i (y)[AF J (y;w) - A£;(y)]}, 

where Z(a) = / Po( w ) exp { £V / yj aj(y)[AFj(y; w) — A^(y)]} dw is a normalization con- 
stant and c = — 1 + log Z(a). 

Substituting p(w) into L*, we obtain, 

L*(a,c) = inf (- log Z(a) + [/(£)- T Oi(y)fc) 

«,y^y 

= - log Z(a)+ inf ([/(£)- £ a,(y)^) 



logZ(a)-sup( £ Oi(y)6-?7(0) 



w^y 4 



= -logZ(a) - I7*(a), 

which is the objective in the dual problem Dl. The {a«(y)} derived from Dl lead to the 
optimum p(w) according to Eq. ([3]). ■ 



Appendix B.2. Proof of Theorem [3] 

Proof Replacing po(w) and AFj(y;w) in Eq. ([3]) with AA(w|0, 1) and w T Afj(y) respec- 
tively, we can obtain the following closed-form expression of the Z{a) in p(w): 

Z{a)± yAA(w|0,I)exp{ £ a l (y)[w T Af i (y) - A^(y)]} dw 

ijy^y 4 

= y*(27r)-T exp{ -iw T w+ £ a i (y)[w T Af l (y) - A^(y)]} dw 

i.y^y^ 

= exp(- ««(y)A^(y) + ^|| £ a J (y)Af t (y)|| 2 ). 

Substituting the normalization factor into the general dual problem Dl, we get the dual 
problem of Gaussian MaxEnDNet. As we have stated, the constraints ^y^y iQ; i(y) = C 
are due to the conjugate of J7(£) = C ^ £j. 

For prediction, again replacing po( w ) an d AFj(y;w) in Eq. ([3|) with jV(w|0, /) and 
w T Afj(y) respectively, we can get p(w) = jV(w|/i, I), where \x = ^ y^yi ai (y)Afj(y). Sub- 
stituting p(w) into the predictive function hi, we can get /ii(x) = argmax ye ;y( x ) ^ T f (x, y) = 
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( Taskar et al 



y i a,;(y ) Afj(y )) T f fx, y), which is identical to the prediction rule of the standard M 3 N 
2003J) because the dual parameters are achieved by solving the same dual 



problem. 



Appendix B.3. Proof of Corollary [4] 

Proof Suppose (p*(w), £*) is the optimal solution of PI, then we have: for any (p(w), £), p(w) E 
T\ and £ > 0, 

#L(p*(w)IIPo(w)) + U(t) < ^(p(w)||po(w)) + U(0- 

From Theorem O we conclude that the optimum predictive parameter distribution is 
p*(w) = A/"(w|/i*, J). Since po( w ) is also normal, for any distribution p(w) = A/"(w|/^, /jj 
with several steps of algebra it is easy to show that K L(p(w)\po(w)) = ^/i T /x. Thus, we 
can get: for any /z T Af»(y) > A^(y) - Vi, Vy / y*} and £ > 0, 

which means the mean of the optimum posterior distribution under a Gaussian MaxEnDNet 
is achieved by solving a primal problem as stated in the Corollary. ■ 



Appendix B.4. Proof of Corollary [6] 

Proof The proof follows the same structure as the above proof of Corollary [U Here, we 
only present the derivation of the KL-divergence under the Laplace MaxEnDNet. 

Theorem [2] shows that the general posterior distribution is p(w) = ^a)P°( w ) exp(w T ^ — 

Ei,y^ y i Q! t(y) A ^(y)) and z ( a ) = ex p(-Ei,y^ y » «i(y) A ^(y))nf=i for the Laplace 

MaxEnDNet as shown in Eq. ([5]). Use the definition of KL-divergence and we can get: 

K A K K A 
KL(p(w)\p {w)) = (w)J?? - J^log- 2 = JZ^kVk ~ J2 log -\ 2' 

k=l A v k fc=l k=l A 

Corollary [7] shows that = , VI < k < K. Thus, we get = and a set 

of equations: fJ>kVk + ^Vk ~ ^A*Jfc = 0> VI < k < K. To solve these equations, we consider 
two cases. First, if [i^ = 0, then rjk = 0. Second, if 7^ 0, then we can solve the quadratic 

equation to get rjk- T]k = 1=I=A ^ + -^, The second solution includes the first one since we 
can show that when uu — > 0, ^v^+j ^fc — > by using the L'Hospital's Rule. Thus, we get: 



/UfcT/fc = -1 ± -\/ Ayu| + 1. 



9. Although T\ is much richer than the set of normal distributions with an identity covariance matrix, 
Theorem [3] shows that the solution is a restricted normal distribution. Thus, it suffices to consider only 
these normal distributions in order to learn the mean of the optimum distribution. The similar argument 
applies to the proof of Corollary [6] 
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Since rjf. < A (otherwise the problem is not bounded), fikVk is always positive. Thus, 



only the solution p.k'Hk = — 1 + \ 1 + Xfii is feasible. So, we get: 



A \4 V A ^ + 1 + 1 



X -ri 2(^X4 + 1-1) 



and 



KL(p(w)|p (w)) = £ JA^ + 1 - log V _ K 



fc=i 



=^Y [ {^l + - x -7 = \ log — 2 

Applying the same arguments as in the above proof of Corollary 2] and using the above 
result of the KL-divergence, we get the problem in Corollary El where the constant —K is 
ignored. The margin constraints defined with the mean \x are due to the linearity assump- 
tion of the discriminant functions. ■ 



Appendix B.5. Proof of Theorem | 



We fo llow the same structure as the proof of PAC-Bayes bound for binary classifier (Langfor d et al 



2001) and employ the similar technique to generalize to multi-class problems as in (jSchapire et al 
19981 ). Recall that the output space is y, and the base discriminant function is F( ■ ; w) € 
TL : X x y — > [— c, c], where c > is a constant. Our averaging model is specified by 
/i(x, y) = (F(x, y; w)) p ( w ). We define the margin of an example (x, y) for such a function 
h as, 

M(h,x,y) = h(x,y) - max/i^y'). (13) 

y'¥=y 

Thus, the model h makes a wrong prediction on (x, y) only if M(h,x,y) < 0. Let Q be a 
distribution over X x y, and let T> be a sample of iV examples independently and randomly 
drawn from Q. With these definitions, we have the PAC-Bayes theorem. For easy reading, 
we copy the theorem in the following: 

Theorem [8] (PAC-Bayes Bound of MaxEnDNet) Let po be any continuous prob- 
ability distribution over TL and let 5 G (0, 1). If F( ■ ;w) G H is bounded by dbc as above, 
then with probability at least 1 — 5, for a random sample T> of N instances from Q, for every 
distribution p over TL, and for all margin thresholds 7 > 0: 



, H~ 2 KL(p\\Po)MN\y\) +lnN + \n5- 1 
Pr Q (M(h,x,y) <0) <Pr v (M{h,x,y) < 7 ) + 0'' 



N 

where Prg(.) and Prx?(.) represent the probabilities of events over the true distribution Q, 
and over the empirical distribution of T>, respectively. 
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Proof Let m be any natural number. For every distribution p, we independently draw m 
base models (i.e., discriminant functions) Fi ~ p at random. We also independently draw 
m variables fa ~ U([—c,c]), where U denote the uniform distribution. We define the binary 
functions gi : X x y — > {— c, +c} by: 

gi(x,y;Fi,m) = 2cl(fa < Fj(x,y)) - c. 
With the Fi, fa, and g{, we define 7i m as, 

^ m 

H m = {f ■ (x,y) h-> — V"gi(x,y;Fi,/Xi)|Fi eH,m€ [-c,c]}. 

We denote the distribution of / over the set Tt m by p m . For a fixed pair (x, y), the 
quantities gi(x, y; Fi, fa) are i.i.d bounded random variables with the mean: 



(gi(x,y;Fi,fa)) F .^ PilH ^ v[ _ CiC] = ((+c)p[fa < Fi{x,y)\Fi] + {-c)p[fa > Fi(x,y)\Fi\) Fi ^ p 

v — c(c + Fi(x,y)) , 

'2c v v n 2c 

h{x, y). 



(—c(c + Fi(x,y)) - —c(c-Fi(x,y))) Fi „ p 



Therefore, (f(x,y))f^ p m = h(x,y). Since f(x,y) is the average over m i.i.d bounded 
variables, Hoeffding's inequality applies. Thus, for every (x, y), 

Pr/~ P -[/(x,y) ~h(x,y) > f] < e ^ . 

For any two events A and -B, we have the inequality, 

Pr(A) = Pv(A, B) + Pr(A, B) < Pr(B) + Pr(B|A). 
Thus, for any 7 > we have 

Pr Q [M(h,x,y) < 0] < Pr Q [M(/,x,y) < 1] + Pr Q [M(/,x,y) > ||M(/i,x,y) < 0]. (14) 

Fix /t, x, and y, and let y' achieve the margin in (|13|) , Then, we get 

M(h, x, y) = h(x, y) - h(x, y'), and M(f, x, y) < f(x, y) - f(x, y'). 

With these two results, since (/(x, y) — /(x, y'))f^ p m = h(x,y) — h(x,y'), we can get 
Pr Q [M(/,x,y) > l\M{h,x,y) < 0] < Pr Q [/(x,y) - /(x,y') > l\M(h,x,y) < 0] 

< Pr Q [/(x, y) - /(x, y') - M(h, x, y) > 1] 

<e~^, (15) 

where the first two inequalities are due to the fact that if two events A <Z B, then p{A) < 
p(B), and the last inequality is due to the Hoeffding's inequality. 
Substitute (fl~5j) into (fl4l) . and we get, 



Pr Q [M(/ l ,x,y) < 0] < Pr Q [M(/,x,y) < |] + e" 
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of which the left hand side does not depend on /. We take the expectation over / ~ p m on 
both sides and get, 

2 

'"y rwy 

Pr [M(M,y) < 0] < (Pr Q [M(/,x,y) < j]}^ +e~^. (16) 

Let p™ be a prior distribution on 7i m . p™ is constructed from p$ over 7i exactly as 
p m is constructed from p. Then, KL(p m \\p™) = mKL(p\\po). By the PAC-Bayes theorem 
(lMcAllesteri . ll999h . with probability at least 1 — 5 over sample T>, the following bound holds 
for any distribution p, 



(Pr Q [M(/,x,y) < l]) f ^n < (Pr D [M(/,x,y) < J}/^ 



mKL(p\ |po) + In A r + lnfl -1 + 2 
By the similar statement as in ()14p . for every / G W m we have, 

Pr c [M(/,x,y) < J <Prj>[M(M,y) < 7 ]+Pifc[M(/,x,y) < ||M(fc, x, y) > 7]. (18) 



Rewriting the second term on the right-hand side of (|18p . we get 

Pr c [A/(/,x,y) < l|M(/»,x,y) > 7] = P^V + y : A/(x,y') < l|Vy' / y : A/i(x,y') > 7] 

< Pr^[V + y : A/(x,y') < ^|A/i(x,y') > 7] 

< P^[A/(x,y') < ^A^y') > 7] 
y'^y 

<(\y\-l)e~^, (19) 

where we use A/(x, y') to denote /(x, y) — /(x, y'), and use A/i(x, y') to denote /i(x, y) — 
/'ix.y';. 

Put (fT6j) . (fT7|) . (fTHj) . and (fT9|) together, then we get following bound holding for any 
fixed m and 7 > 0, 



^ r ,r/, \ „ r „,,, . , ,,„ -VuL mK L(p\\p ) + In N + In 5- 1 + 2 
Pr Q [Af(/ l ,x,y) < 0] < Pr c [M(^x,y) < 7] + \y\e ^ + yj %N - 1 

To finish the proof, we need to remove the dependence on m and 7. This can be done 
by applying the union bound. By the definition of /, it is obvious that if / E TC m then 
/(x, y) £ {(2fc — m)c/m : k = 0, 1, . . . , m}. Thus, even though 7 can be any positive value, 
there are no more than m + 1 events of the form {M(f, x, y) < 7/2}. Since only the ap- 
plication of PAC-Bayes theorem in (|17p depends on (m, 7) and all the other steps are true 
with probability one, we just need to consider the union of countably many events. Let 
$m,k = S/(m(m + l) 2 ), then the union of all the possible events has a probability at most 
Ylm k ^ m > k = I] m ( m + 1)5 /{m(m + I) 2 ) = Therefore, with probability at least 1 — 5 over 
random samples of T>, the following bound holds for all m and all 7 > 0, 
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Pr g [M(/i,x,y) < 0] - Pr p [M(h,x,y) < 7] < \y\e + 



< \y\e~^ + 



lmKL(p\ bo) + In iV + In ^ + 2 



27V - 1 



'mifL(p||p ) + lnJV + 31n^ + 2 



27V- 1 

v|y| 2 



Setting m = |~16c 2 7 2 In xL( P ||p )+i 1 gi yes the results in the theorem. 
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